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Abstract 

Email is an increasingly important and ubiquitous means 
of communication, both facilitating contact between pri- 
vate individuals and enabling rises in the productivity of 
organizations. However the relentless rise of automatic 
unauthorized emails, a.k.a. spam is eroding away much 
of the attractiveness of email communication. Most of 
the attention dedicated to date to spam detection has fo- 
cused on the content of the emails or on the addresses or 
domains associated with spam senders. Although meth- 
ods based on these - easily changeable - identifiers work 
reasonably well they miss on the fundamental nature 
of spam as an opportunistic relationship, very different 
from the normal mutual relations between senders and 
recipients of legitimate email. Here we present a com- 
prehensive graph theoretical analysis of email traffic that 
captures these properties quantitatively. We identify sev- 
eral simple metrics that serve both to distinguish between 
spam and legitimate email and to provide a statistical ba- 
sis for models of spam traffic. 



1 Introduction 

Spam is quickly becoming the leading threat to the via- 
bility of email as a means of communication and a lead- 
ing source of fraud and other criminal activity world- 
wide. Much is known about spam traffic. According 
to the Spamhaus project [16| the vast majority of spam 
emails presently originate in the USA and China, hosted 
by well known ISPs and generated by identified individ- 
uals. Nevertheless an increased effort in criminal inves- 
tigation and waves of high profile legislation have not 



yet succeeded at reducing the relentless increase in spam 
traffic [ 10 1, which now accounts for about 83% of all in- 
coming emails, up from 24% in January 2003 1 12 1. 

It is often said that the problem of spam email is that 
it is an extremely asymmetric threat. While it is techni- 
cally easy and very cheap to send a spam email it requires 
sophisticated organization and much higher costs at the 
receiving end to sort out legitimate emails from junk. 

This asymmetry is of course not directly manifest in 
the sender's email address, on the domain he/she uses, 
nor certainly on the simplest characteristics of the mes- 
sage (e.g. its size). It is rather a property of structural 
relationships - spammers tend to be senders to a socially 
unrelated set of receivers - while legitimate email tends, 
instead, to reflect the variety of mutual personal, profes- 
sional, institutional ties among people. Thus by identi- 
fying the comparative structural and dynamical nature of 
email traffic, we expect to find good discriminators be- 
tween normal email and spam traffic. The goal of this 
work is to present the modeling of email - legitimate and 
spam - traffic as networks, in order to identify graph the- 
oretical metrics that can be used to differentiate between 
the two. We are also interested in providing a unified 
view of several metrics characterizing the relationships 
between senders/recipients and of their evolution for le- 
gitimate and spam traffics in order to formulate, in the 
future, a predictive model of spam dissemination. 

Our study goes beyond several recent analyses |4] [7J 
on the graphical nature of spam traffic. We deal with 
a different database, involving a much larger number of 
users and messages, and analyze a wider set of metrics, 
both static and dynamic. We will show that there is no 
single graphical metric that unequivocally distinguishes 
between legitimate and spam email. There are, however, 
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several graph theoretical measures that can be combined 
into a probabilistic spam detection framework. These are 
then identified as candidates for the construction of a fu- 
ture spam filtering algorithm. 

The remaining of this paper is organized as follows. 
In section|2]we introduce the modeling of email traffic in 
terms of two graph classes and present the types of met- 
rics to be studied. Section [3] gives several global prop- 
erties of our workload. We evaluate the several metrics, 
for each of the two graph classes, in Section |4] In Sec- 
tion 13 we present related work. Finally, we present our 
conclusions in section |6] and discuss open questions left 
for future work. 

2 Graph-Based Modeling of E-mail 
Workloads 

In order to characterize spam email traffic versus non- 
spam we define two types of graphs: a user graph and a 
domain graph. The vertices of the user graph are email 
senders and recipients present in our log. An email sent 
by A to receiver B results in a link between A and B. The 
domain graph has as vertices the domains of the external 
senders to the local domain being analyzed, and users if 
inside the local domain. Its construction is similar to the 
user graph but sets of users external to the local domain 
who share an external domain are aggregated together 
into a single node. Note also that the domain graph is a 
simpler bipartite graph and not all characteristics studied 
will be present in it. 

The edges of both graphs can take one of four 
forms: directed or undirected; binary 1 (or unweighted) 
or weighted (e.g. by the number of emails exchanged or 
by the total size of the emails exchanged in bytes). These 
options cover most of the possibilities for direct graph- 
ical construction out of the email logs at our disposal 
(described in Section[3}. 

The user graph is in principle the most useful in identi- 
fying the individual nature of users as spam or non-spam 
senders. In some cases these characteristics extend to 
the whole external domain (particularly if the spammer 
changes his name 2 more often than its domain) and the 
domain graph produces a useful aggregation of the user 
data. We believe that user graphs will be more effective 
in identifying senders of non-spam since spam senders 
tend to change their full email address very frequently. 

The user or domain graphs can be constructed exclu- 
sively out of spam traffic, non-spam traffic, or the ag- 
gregate set of all emails. Some of the graph theoreti- 
cal properties studied below will be analyzed in terms of 

'If any message was sent from A to B, over the observation time a 
link is established. 

2 The first part of the address, located before the @. 



the graphs formed when considering the different traf- 
fics separatelly while others will be evaluated on selected 
nodes from the aggregated traffic. The selected nodes 
represent senders in the aggregated graph and can be di- 
vided in two classes - spam and non-spam - based on the 
type of emails they send. These classes do not form dis- 
joint sets, see Table [2] Since we are analyzing nodes in 
terms of the email types they send, we will not present 
an analysis of the edges (traffic) comming in. In other 
words we will not attempt to identify spammers from 
the set of emails that are sent to them, simply because 
the statistical properties of such messages are clearly less 
significant as those of the messages they send out. 

Given these two graph constructions we will analyze 
two types of properties: (i) structural and (ii) dynami- 
cal. The former capture the structure of social relation- 
ships between users exchanging emails, while the latter 
relate to how graphical properties evolve over time. As 
we shall show below there are distinct independent sig- 
natures of spam traffic in both structural and dynamical 
properties. As a consequence they should be taken to- 
gether to generate a better detection procedure. 

3 E-mail Workloads 



Measure 


Non-Spam 


Spam 


Aggregate 


# e-mails 


336,580 


278,522 


615,102 


Size of e-mails 


11.00 GB 


1.70 GB 


12.71 GB 


# sender users 


94,985 


170,664 


263,144 


# sender domains 


20,414 


48,087 


59,971 


# recipients 


26,450 


12,867 


35,471 



Table 1 : Workload summary 



The construction of the graphs introduced in Section[2] 
is subject to several practical constraints. Our knowledge 
of email traffic comes from Postfix logs of the central 
SMTP incoming/outgoing servers of an academic de- 
partment from a large University in Brazil. Incoming 
emails only contain the recipients internal to the depart- 
ment's domain. Outgoing emails contain the full list of 
recipients. Moreover our data set does not contain infor- 
mation about emails exchanged between users external 
to the domain. 

The logs were collected between 11/18/2004 and 
12/31/2004 and contain the following data for each 
email: (i) received time and date; (ii) a reject flag, in- 
dicating whether connection was rejected during e-mail 
acceptance (iii) Size of email 3 ; (iv) sender address; (v) 
list of recipients and (vi) a spam flag, indicating if it was 
classified as spam or not by Spam- Assassin |15|. The 
logs were sanitized and anonymized to protect the users' 

3 Only for the accepted emails. 
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privacy. Statistical characteristics of the workload are in 
agreement with previous email traffic analyses |9]l6l ll7l . 
Tabled summarizes the data set. 

Spam-Assassin II 151 is a popular spam filtering soft- 
ware that detects spam emails based on a changing set 
of user-defined rules. These rules assign scores to each 
received e-mail based on the presence in the subject or 
in the e-mail body of one or more pre-categorized key- 
words. Spam-Assassin also uses other rules based on 
email size and encoding. Highly ranked emails, accord- 
ing to these criteria, are flagged as spam. 



The number of spam senders that are internal is very 
small. The fraction of these that send exclusively spam 
is 81%. These addresses correspond presumably to inter- 
nal emails that have been forged and do not actually ex- 
ist 4 . The remaining addresses send both spam and non- 
spam and are probably genuine users whose addresses 
have been spoofed. 

4.1 Structural analysis of spam vs. non 
spam email graphs 



4 Spam Networks vs. Legitimate 
Email Networks 



Type 


External 


Internal 


Spam 


169931 


(277535) 


733 


(987) 


Non-Spam 


93666 


(186607) 


1319 


(186607) 


Spam & 
Non-Spam 


2366 


(-) 


139 


(-) 


Total 


263231 


(462142) 


1913 


(152960) 



Table 2: Number of unique email addresses by origin (in- 
ternal or external to the domain) and classified as spam, 
non-spam or both. Numbers in parentheses indicate the 
total number of emails sent by each class. 

Although spam emails originate principally from users 
outside the local domain spam senders use several tech- 
niques to falsify or steal local addresses (e.g. crawl- 
ing the web for email addresses available at web pages, 
network sniffing, name dictionaries). As a result spam 
email does originate from the local domain both from 
real users and from forged ones. This mixing between 
regular email users and spam senders can lead to more 
complex email networks than might have been naively 
expected and poses a challenging problem for detection. 

Table summarizes the number of addresses and 
emails by node classes and by internal or external ori- 
gin. Node classes are as defined in Section[2]plus a third 
category -Spam & Non-Spam - which is the intersection 
of the former two. The size of this overlap shows the 
impact of email address spoofing. 

Most emails originate outside the domain. In our log 
most outside users are spam senders and account for the 
majority of the emails. Because it is very easy for a 
spammer to forge an address spam senders use many 
addresses simultaneously and/or frequently switch be- 
tween them. This strategy is visible in our database as 
non-spam internal users send many more emails per user 
than spam internal users. We expect that this is a general 
feature of spam versus non-spam traffic. 
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(a) User Graph 
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# Out Degree 



(b) Domain Graph 

Figure 1: Distribution of the node degrees for sender 
classes in the aggregated graphs. 

One of the most common structural measures ana- 
lyzed in complex networks is the distribution of the num- 
ber of the incoming and outgoing node connections, or 
degree 11411131 12 . Figure[0shows the distribution of the 
out-degrees of the different sender classes for the user 
and domain graphs. 

The out degree distributions approximately follow a 
power law (C/x a ). By using a simple statistical linear 
regression we estimated the exponent a that best mod- 
els the data. For the user graph we obtained a = 1 .497 
(with R 2 = 0.965.) for spam senders and a = 1.359 (R 2 = 
0.981) for non-spam senders. We conclude that the spam 
sender's out degree distribution is slightly more skewed. 

4 This suggests that a simple effective way to filter out spam origi- 
nating from internal domain addresses is to verify that they correspond 
to an existing user. 
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We conjecture that this is because spammers have a lim- 
ited knowledge of the set of users in each specific do- 
main. Since in our analysis we only observe a fraction 
of the spammers' lists (the one composed by the mes- 
sages sent to the domain studied) there are no spammers 
with recipients' lists as large as those found for non-spam 
senders. 

Degrees from 1 to near 20 are much more probable 
for spam senders than for non spammmers, while very 
large degrees are more likely in non-spam. There is no 
difference between the two sender classes in the body of 
the distribution, for degrees from about 20 to 400. The 
mean out-degrees, are 3.56 and 1.63 for non-spam and 
spam, respectively (see Table|2j. 

In the domain graph the out-degree distribution shows 
a much higher probability for nodes with low out-degree 
in spam traffic than in non-spam. 
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Figure 2: Distribution of Communication Reciprocity 

In order to evaluate discrepancies between in and out 
sets of addresses for a given node we create a simple 
metric called Communication Reciprocity (CR) of x as: 



CR(x) 



\os{x)nis(x)\ 

\OS(x)\ 



(1) 



where OS(x) is the set of nodes that receive a message 
from node x and IS(x) is the set of addresses that send 
messages to x. With our choice of normalization this 
metric measures the probability of a node receiving a re- 
sponse from each one of his addressees. 



Figure|2]shows the distribution of the Communication 
Reciprocity. This metric is able to effectively differen- 
tiate users associated with spam from non-spam. The 
grouping of users in the domain graph makes this dif- 
ferentiation more difficult. However, even in the domain 
graph the difference is very clear. 

The analysis of the communication reciprocity sug- 
gests that a strong signature of spam is its structural im- 
balance between the set of senders and receivers associ- 
ated with a spam sender. However whenever there is an 
imbalance, how many of the unmatched addresses corre- 
spond to spam senders? 

To address this question, let the asymmetry set for a 
node be the difference of its in and out sets. Figure [3] 
shows the number of spam addresses in the asymmetry 
set versus the size of the asymmetry set itself. The re- 
sulting relation is very well fit by a straight line at 45°, 
showing a strong correlation between the two numbers. 
The statistical correlation (slope) is p = 0.979 for user 
graph and p = 0.998 for the domain graph. So, almost 
all senders in the asymmetry sets are spammers indiffer- 
ently of the graph analyzed. The non spam data is not 
very well modeled by a 45° straight line. These corre- 
spond to the non spam senders that were not answered 
(or to whom we could not see an answer in our log). The 
correlation is p — 0.8723 and p = 0.9932 for the user 
and domain graphs respectively. As expected from the 
result of the spam data the non spam data has a higher 
correlation for the domain graph. 
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Figure 3: Number of spams/hams in the asymmetry set 
vs. the number of nodes in the asymmetry set 

This result can be made sharper if we analyze the cor- 
relation between the number of spammers in the incom- 
ing set of a node and spammers in its asymmetry set. 
We find p = 0.999 and p = 0.994 for the user and 
domain graphs, respectively. There is a slightly worse 
correlation in the domain graph. We conjecture this is 
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due to the external reliable domains used by spammers 
(e.g. through spoofing and forging techniques). These 
may not be counted in the asymmetry set since they are 
replied through their legitime emails but are part of the 
incoming set as spammers. 

These results show that spam messages are almost 
never replied to, except in cases of spoofed or forged do- 
mains or users' ids and rarely, we assume, intentionally. 

Asymmetry sets can in principle be used as a compo- 
nent in a probabilistic spam detection mechanism. The 
arrival of an email from a sender that has already been 
contacted by an internal recipient is an indication that it 
has high probability of being a non spam email. 

Another common characteristic of social networks is a 
high average clustering coefficient (CC) 1 8 1 . The CC of a 
node n, denoted C n , is defined as the probability of any 
two of its neighbors being neighbors themselves. This 
metric is associated to the number of triangles that con- 
tain a node n. For an undirected graph, the maximum 
number of triangles connecting the N n neighbors of n 
is N n x (N n — l)/2. Thus, the CC measures the ratio 
between actual triangles and their maximal value. Dur- 
ing clustering coefficient analysis we only consider the 
nodes with N n > 1, since this is a necessary condition 
for the CC to be nonzero. 




0.2 0.4 0.6 0.8 1 

Clustering Coefficient 



tary measure to the CC and SCC is the average path 
length between two nodes. The CC and average path 
length properties are generally related to the so-called 
small world networks, which display high CC (higher 
than a random graph with the same connectivity) and 
short path length, usually comparable to log N, where 
N is the number of nodes in the graph. 

In our experiments both the SCC and the average path 
length have not been able to convincingly differentiate 
spam from legitimate traffic. All of the graphs studied 
are small world networks to some extent. Also all of the 
graphs have giant connected components. Other studies 
have used the clustering coefficient of SCCs to identify 
spam in networks constructed from the correspondence 
of a single user 1 4 1 . However for data from servers that 
aggregate the communication between different senders 
and receivers we find that these metrics do not suffice to 
perform a clear identification of spam. 

Another interesting structural characteristic of graphs 
is the probability of visiting a node during a random walk 
through the graphs Q . At each step of the random walk 
we need to select the next node to be visited. This can 
be done in two ways. The next node can be randomly 
selected from the out set of the current node or we can 
perform a jump. For a jump, one of the nodes of the 
graph is selected randomly as the next node. Note that, 
this measure is related to node betweenness 5 since higher 
node betweenness tends to generate a higher probability 
of visitation. Nevertheless this probability is much easier 
to compute than node betweenness for large graphs. The 
probability P(x) of finding a node a; in a random walk is 
computed iteratively as follows: 

z e is(x) 1 w 



Figure 4: Distribution of the clustering coefficient for the 
different classes in the aggregated user graph. 

Figure|4]shows the distribution for the CC of nodes in 
the aggregated graphs. The clustering coefficient mea- 
sures cohesion of communication, not only between two 
users but among friends of friends. This is a pervasive 
characteristic of social relations that is absent from spam 
sender receiver connections. As a result regular email 
users have higher CC than spam senders. In terms of the 
average value, regular email also has a higher value (0. 16 
against 0.08). 

Some recent studies 1 4 1 have studied graphical metrics 
of the strongly connected components (SCC) of email 
graphs. A SCC is a subset of the nodes of a graph, such 
that one node can be reached from any other node in 
the set following edges between them. A complemen- 



where d is the probability of performing a jump during 
a random walk, TV is the number of nodes in the graph. 
The parameter d is a dumping factor that can be varied. 
A value usually used in the literature is 0.15 |5 1, that is 
also the value we use in our measurements. 

The results are shown in Figure [5] The difference be- 
tween spam and non-spam behavior is less noticeable in 
the domain graph than in the user graph. Spam nodes 
show generally lower probabilities of being visited, as 
might have been expected because of the asymmetry of 
their communication. Visiting probabilities for spam 
nodes in the user graph are localized to the initial and 
final parts of the distribution and are less pronounced in 
the middle range. 

The node visitation probability distributions can be 
modeled by a power law. We estimate the corresponding 

5 The number of shortest paths that pass throught a node. 
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exponent at a = 0.694, 1.097 and 0.975 for the non- 
spam component of the user graph, and for the non-spam 
and spam components in the domain graph, respectively. 
The R 2 associated with the fits varies between 0.959 and 
0.998. The R 2 for the spam curve of the user graph is 
0.853, showing that it is not well modeled by a power 
law, as visual inspection suggests. 
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Figure 6: Graph evolution by percentage of messages. 
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Figure 5: Distribution of the probability of finding a node 
during a random walk. 



4.2 Dynamical analysis 

Beyond the structural characteristics of the graphs of 
spam and non-spam email other metrics related to the dy- 
namics of communication and graph evolution may help 
model spam traffic. 

A large amount of effort has been devoted recently to 
creating realistic growth models for complex networks. 
One of the key characteristics of such models is the evo- 
lution of the number of nodes and edges, as well as the 
probabilistic connection rules for the new nodes to those 
already in the graph. Figure|6]shows the evolution of the 
graph in terms of number of nodes and edges. We plot 
these quantities against percentage of messages evalu- 
ated for each graph, to avoid the influence of the rate of 
message arrival, which varies with time depending on the 
type of the traffic being considered (e.g. the bell shaped 
behavior for the non spam traffic against the almost con- 
stant rate for spam traffic |9]|3]|6)). 



The growth of the aggregated graph (a composition 
of the spam and the non-spam graphs) results from the 
growth in both the spam and non-spam components. The 
spam subgraph is a much more rapidly growing struc- 
ture. Over the time of the log we find no saturation ef- 
fect in these numbers. Instead the number of addresses 
and edges grows almost linearly with the number of 
emails. An eventual saturation in the non-spam compo- 
nent might be expected for longer times. 

Another important dynamical graph characteristic is 
how the weights of edges evolve, i.e. how the flow 
of information between nodes varies over time. An 
interesting metric that can be used to measure this is 
the stack distance Q of connected pairs in terms of 
the emails they exchange over time. The stack dis- 
tance measures the number of distinct references be- 
tween two consecutive instances of the same object in 
a stream. We take the total email log as the stream and 
each pair sender/receiver as the object. Ordering of the 
sender/receiver is disregarded. Figure0shows the pairs' 
stack distance distributions. We see that temporal local- 
ity is much stronger in non-spam traffic. This means in 
practice that legitimate users exchange emails over small 
concentrations of time. 

We were also interested in studying how do the nodes 
communicate with their peers in terms of the number of 
messages. Because of the impersonal nature of spam we 
expect that spam senders communicate in a more struc- 
tured way with their recipients. Not only will legitimate 
senders show more variation in the number of messages 
they send to each person in their out sets, they will also 
show variability of the messages themselves in terms of 
their sizes. In order to quantify these effects we evalu- 
ated the normalized entropy of the in and out flows for 
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Figure 7: Distribution of stack distances for the pairs in Figure 8: Distribution of entropy of the number of mes- 
the different traffics. sages in the flow of e-mails for the aggregated graph. 



each node, defined as 

_ Eyeosjx) ~P(y) * l°9(p(y)) 
H[X) ~ log(\S(x)\) ' [i) 

where p(y) is the probability of y receiving a message 
from x and and \S(x) | is the number of unique elements 
in the set being considered. 

Figure|8]shows the normalized entropy for the out flow 
of the nodes in the different sender classes for the ag- 
gregated graphs. As expected, spammers communicate 
with their recipients with much less variability (higher 
entropy). A similar analysis was conducted considering 
the bytes that each node sends with similar results. 

5 Related Work 

Several studies have recently analyzed the statistical 
properties of email workloads I6lllll l9ll3l ll7l . These 
studies consider the messages as a flow and study met- 
rics such as inter-arrival times, e-mail sizes, and number 
of recipients per e-mail. Although spam and legitimate 
email show differences in terms of these metrics little has 
been done about using them to filter out spam. The work 
of the present manuscript takes a different tack by cre- 
ating a graph theoretical higher level representation of 
email traffic and attempting to differentiate spam from 



legitimate email in this abstraction. We believe that this 
approach, based on graph theoretical metrics, proves to 
be much better suited to the filtering problem. 

Other recent papers have focused on models of email 
traffic as graphs |4j[7). For example in Ref. |4| a graph 
is created representing the email traffic captured by the 
mailbox of an individual user. The subsequent structural 
analysis is based on the fact that such a network pos- 
sesses several disconnected components. The clustering 
coefficient of each of these components is then used to 
characterize messages as spam or non-spam. Their re- 
sults show that 53% of the messages were classified us- 
ing the proposed approach and they obtained 100% of 
accuracy in this subset. Our graphs are based on a dif- 
ferent type of dataset, i.e. the logs of SMTP servers, 
and as such do not take the perspective of the individual 
user. As a result for our data set the approach proposed 
in |4| can not be used successfully since there is a giant 
SCC in all of the graphs shown. In |7| the authors used 
the approach of detecting machines that behave as spam 
senders by analyzing a border flow graph of sender and 
recipient machines. Moreover, they analyzed the evolv- 
ing graph structures over a period of time, based on a 
single metric using the HITS algorithm. Our workload 
differs from theirs since we do not have access to the 
underlying overlay network formed by email relays. 
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6 Conclusions 
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In this paper, we have shown that legitimate and spam 
email graphs differ in two fundamental classes of char- 
acteristics: structural, which capture the graphs' archi- 
tecture, and dynamical, concerning node communication 
and graph evolution. 

Structurally we showed that spam and non spam sub- 
graphs are characterized by different distributions of the 
clustering coefficient of their nodes. Legitimate email 
users display on average higher clustering coefficients 
than spam senders. Node visitation probability is a mea- 
sure of the centrality of a node relative to other nodes 
in the graph. Legitimate email nodes have higher visita- 
tion probability than spam nodes. We also defined a new 
metric called communication reciprocity. It measures the 
probability that a node receives a response from any of 
its addressees. There is a strong difference in the prob- 
ability distributions of the communication reciprocity in 
the legitimate and spam graphs; legitimate nodes have a 
much higher probability of being responded to. Another 
metric introduced in this paper is the email asymmetry 
set, which represents the difference between the sets of 
in and out edges of a node. We showed that there is a 
strong correlation between the size of asymmetry sets 
and the number of spammers in the set. Dynamically 
the spam graph grows much faster than the legitimate 
email graph. The legitimate email graph grows more 
slowly both in the number of nodes and edges, manifest- 
ing the higher stability of relations in a social group. Two 
other dynamical metrics, entropy and stack distance, are 
used to reveal the temporal characteristics of communi- 
cation among nodes. Spam nodes display a much higher 
entropy than legitimate email users, and a much longer 
stack distance. 

We have shown that differences in both classes of 
graph characteristics can be explained by the same hy- 
pothesis, namely that legitimate email graphs reflect real 
social networks, while spam graphs are technological 
networks, devoid of a sense of community. Although no 
single metric can unequivocally differentiate legitimate 
emails from spam, the combination of several graphical 
measures paint a clear picture of the processes whereby 
legitimate and spam email are created. For this reason 
they can be used to augment the effectiveness of mecha- 
nisms to detect illegitimate emails. 
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